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A phenomenological model hamiltonian to describe the folding of a protein with any given 
, sequence is proposed. The protein is thought of as a collection of pieces of helices; as a consequence 

' its configuration space increases with the number of secondary structure elements rather than with 

^\ I the number of residues. The hamiltonian presents both local (i.e. single helix, accounting for the 

stiffness of the chain) and non local (interactions between hydrophobically-charged helices) terms, 
and is expected to provide a first tool for studying the folding of real proteins. The partition function 
for a simplified, but by no means trivial, version of the model is calculated almost completely in an 
analytical way. The latter simplified model is also applied to the study of a synthetic protein, and 
' some preliminary results are shown. 

(N ■ 

I. INTRODUCTION 

o. 

^ , Protein folding is one of the most challenging problems in molecular biology, and many theoretical models have 
been proposed, aimed to explaiiuthe thermodynamics as well as the kinetics of this process. 

It is experimentally knownHij that a protein, under proper solvent and temperature conditions, folds from a 
denatured random-shaped state to its "native" state, which is characterized only by the amino acid sequence ("the 
' primary structure"), and does not depend on the initial state. |-| 
.. The experimental data for the folding process support the so-called "molten-globule" picturecl: folding of small 
O ' single-domained proteins in proper solvent conditions would start with a rapid collapse from a coil state to a compact 
, one, which is not unique, but is in metastable equilibrium with several other compact states. The protein would 

appear at this point as a molten globule, not presenting a definite shape. The latter diffuses among the various 
T-H , conformations until it finds its way across the free-energy barrier (probably unique) separating it from the most 
K*' ■ stable (native) state, and eventually would reach the native state. Because of the cooperativity of the process, folding 
appears from a thermodynamical point of view as an all-or-none transition, similar to a first-order phase transition 
in infinite systemsd. 

The understanding of the physics involved in protein folding has greatly profited byrideas and techniques coming 
from statistical mechanics of disordered systems and random energy models (R.E.M.)eI. As far as thermodynamics 
is concerned, the number and the complexity of interactions in which residues are involved, as well as the fact that 
^\ functionally similar proteins may have somewhat different sequences (for instance, Ivsozymes of different species), 
have suggested to approach the folding problem by means of an analogy with R.E.ME. Analytic resultsQ predict a 
^ , glass-transition when the probability distribution of the couplings is broad enough (that is, when the residues behaves 
H ■ very differently from one another). Several studies have been carried out on short model heteropolymers on a lattice 
' (both with random and with specified interactions), where a preliminary complete enumeration always allows us 
to find the energy of any configuration. These studies have revealed some important requiremerits that a sequence 
should fulfil in order to be a good folder (gap in the spectrum, non-degeneracy of the ground statea, particular values 
of the ratio between collapse and folding temperaturesE£l) . Debate is still open on the relative importance of these 
properties for characterizing good sequencesEJ. However, what the analysis has assessed is that these requirements 
. !^ I are not typical of random sequences, so that a true protein cannot be considered a random heteropolymer, as long as 
■ the feature of being able to fold to a stable and unique native state is concerned. n P n 

Lattice models have also been employed to study the kinetic aspects of the folding procesfO'tjM, under the 
hypothesis that they do not depend on microscopic details of the dynamics. One still gets thus meaningful results 
when using fictitious Monte Carlo dynamics instead of the true one (which, of course, cannot be easily implemented on 
a lattice). The above assumption seems reasonable, since the predictions of various diffusive regipies and relaxation 
times, that come out of these studies, are in good qualitative agreement with phenomenologylij. Besides, these 
works have revealed that kinetic accessibility of the native state from a generic initial condition is as important as 
the already mentioned thermodynamical requirements, for a sequence to be a good folder. As a consequence, it is 
commonly believed that real proteiiiS|rpresent a rough but funnel-shaped free-energy landscape, which provides an 



in 
o 



X3 
O 

o 



overall bias towards the native stateE§E3. 

However, the success of simple lattice models also states their limits. 
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The results [Ohtained so far constitute an increasing evidence of the fact that proteins are very pecuHar 
heteropolymersc3E_l: far from being random, they present an "energy-landscape" with correlated minima, and a 
native state which has probably been selected to be the ground state of the highest number of sequences (to face the 
risk of mutations) and to be kinetically accessible from all initial configurations of the chain. The role played by local 
contacs along the chain (the secondary structure), which had been disregarded by REM, is, now coming again strongly 
to attention, as it seems to be crucial for rapid folding and resistance against mutationals. 

Simple heteropolymer models on a lattice are not well suited to deal with this kind of features, and the only way 
to improve our knowledge within this scheme would be that of making simulations with longer chains: this is clearly 
unfeasible, as the only way to find the native state relies on a complete enumerations of all compact configurations of 
the polymer, whose number grows exponentially with the number of residues. 

The need for an exhaustive enumeration comes from the fact that chain connectedness is responsible for a strong 
frustration of a generic polymer, so that the energy landscape is rugged, and states of the same energy (but very 
different in shape) may be found anywhere, asking for a complete searching of the configuration space. 

Real proteins circumvent this problem when folding to the native state, since they are provided with a sequence 
which, given the geometrical constraints of connectedness and microscopic steric hindrance to be fulfilled, encodes 
the smallest frustratiniL and the smoothest energy landscape (essentially, this is the statement of the "principle of 
minimal frustration"l!lll3) 

The problem with lattice models comes from the fact that we do not know a-priori, given the constraints of lattice 
geometry (intrinsically different from the natural ones), which sequences of what hydrophobic charges correspond to 
the smoothest landscapes, and the only way to find it out consists in a exhaustive numerical analysis of the entire 
configuration space, since no definite hints can be provided by real proteins. 

The situation would greatly improve with a model directly related to the real systems, such that a mapping would 
exist between protein and model configurations. In this case, a direct comparison of the ground state with true native 
one would be possible, and one could check the goodness of the model by direct inspection. 

In this paper, we present a model which, in a coarse-grained way, allows us to deal with any chosen sequence of 
any lenght. Such model is based on a description of the protein chain in terms of pieces of helices, implying that the 
building blocks are indeed the elements of the protein secondary structure, which have an "internal energy" related 
to Ramachandran's maps and mutually interact according to the mean "charge" they contain. The hamiltonian we 
obtain is realistic, yet quite complicated, just because of its generality. However, simplified models can be extracted 
from it and studied independently, and the results can be compared to real native states. 

The pap er is organized as follows: in Sec. || we define the model, discussing the various terms in the hamiltonian, 
in Sec. [II we derive a simplified model and calculate its ground state and partial partition function; in Sec. IV, we 
briefly summarize and comment our results. 



II. THE MODEL 



A. Preliminary remarks 

We start from the observation that accurate studies of the phenomenology reveal a number of common features of 
the native states of the majority of simple, single-domain proteins: 

• the native state is organized hierarchically in secondary, supersecondary, and tertiary structures. Typical el- 
ements of the secondary structure are a~helices, /^-strands and tight-turns; supersecondary rules tell us how 
these elements pack together locally (prescribing, for instance, the right-handedness of /J-X-/3 units), while the 
tertiary structure refers to the way the above mentioned elements are arranged in space. As a general remark, 
one can say that |licieces of secondary structure that are adjacent in the sequence are also often in contact in 
three dimensions"E3; knots in the chain seem also to be generally forbidden ; 

• the native state is highly compact, with the non-polar residues buried on the inside, in order to minimize their 
contact with the solvent. The urge to protect the hydrophobic residues from water is believed to be the leading 
factor in the folding process: the secondary and supersecondary structures would emerge in order to accomodate 
in the best way the hydrophobic core, with the minimal frustration of local interactions; 

• the periodicity of the helices tends to mimick that of the sequence, when there is one. It is known, for instance, 
that a-helices on the surface of the protein's native state usually present an external side, exposed to the solvent, 
with polar residues, while the other one is hydrophobic; 
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• the partial success of structure identification methods, based on the analysis of the homologies between sequences, 
suggests that, even though the folding process is dominated by hydrophobic interactions, local properties pose 
serious constraints on the final structure, and somewhat limit the number of possible choices. 

For the above reasons, we aim to construct a model able to handle both the local and the global aspects of the 
main chain geometry (disregarding side-chain configurations). 

One way to cope with the two contrasting requirements of a coarse grained "effective" modelling of the interactions, 
and of a good control of the local constraints imposed by steric hindrance and chain connectedness, is to think of the 
protein as made up by pieces of different helices, linked together one after the other. This picture is general enough to 
describe probably all the relevant conformations of a protein: it is built having in mind the above-mentioned features 
of native states, but it may also represent chains in coil conformations, when many small helices are present, with 
random orientation. For the consistency of this approach, we shall approximate with a helix a part of the protein at 
least three peptide units long; shorter helices will not be allowed. Since a perfect helix involves repetition of a fixed 
dihedral angle (0, ijj) at each peptide imit, this approximation may appear to be somewhat crude when there is a 
strong local variability of the above variables, as it happens in loops and turns. However, this is not a major problem, 
because a tight turn can be fairly well represented by a short piece of regular helix, and the shortness implies that 
only a small error in the energy is introduced. The same holds true for any "coil" region of the chain, which may be 
partitioned and treated in the same way. 

The helices are described by their radius and pitc;li, their lenght, and the orientation of a reference frame attached 
to each of them, which specifies the direction of their axis. We shall see later that these are not the most useful 
variables to introduce in the hamiltonian, but we start with them for the sake of simplicity. 

The equation of the curve representing the protein chain is assumed to be: 

r{s) = J2bi{s)hi{s) , (1) 

i=l 

where s is a continuous variable parametrizing the curve points, and ranging from to A'', the total number of residues; 
bi{s) is the limit for A — > of the function 

bi{s,X) = -gi{s,X} + gi-i{s,X} , 

which represents a "barrier": gi{s, A) is a function of the variable s designed in such a way that in the limit A — > it 
becomes a step function. For instance one could choose 

ffils^A) = ^tanh(^-y^) , 

whereby 6i(si_i,0) =bi{si,Q) = 1/2, and 

g,{s,0) = ^{S,0) = Sis - Si) . 

The hj are the helices expressed in their reference frame (ei i, i, 63, i): 

hi(s) = tti [ (cos(ui(s - Si_i)) - 1) ei^i + sin(ui(s - Sj-i)) e2,i + 

Uihi{s- Si-i)e3^i]+hi_i{si-i) , (2) 

labeled so that helix i starts at Si_i and ends at Si, with sq = and sn,, = ^- is the total number of helices, 

residues are labeled from 1 to TV, and the convention holds that a residue sitting at the junction between two helices 
belongs to the first of them. We will name from now on rij = Sj — Sj_i the lenght of helix i. We choose 



(3) 



where L is the lenght of a peptide unit (the distance between two neighboring a-carbon atoms), so that the line 
element on each helix is hj ds = Lds. We assume the sign ct^ = ±1 of Ui positive for right-handed and negative for 

left-handed helices, while the product Uj/ij is always positive. We also ask that helices have the same lenght of the 

piece of chain they represent: this may be done by requiring that As = 1 when we move along the protein chain of 
one peptide unit: in this way the above defined Ui coincides with the number of residues in the secondary structure 
element that the helix describes. 
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We see that six scalar variables are needed to specify a helix: they are rii, ai, hi, and three Euler rotation angles 
relating the helix reference frame to the fixed "laboratory" one. For an infinite helix, the radius a^, the pitch hi and 
the angular parameter m are related to curvature Ki and torsion in a straightforward way: 



hi = — , u, = La,{K^ + )2 . (4) 



Ki 



Obviously curvature and torsion are the natural candidates to appear in the local part of the hamiltonian, which 
will take into account the stiffness of the chain and the steric hindrance of the residues. One is therefore led to study 
the form of curvature and torsion of the curve in Eq.(|l|). After some lengthy but straightforward manipulations, 
remembering that hi{si) = hi^i(si) and taking into account only the leading terms in the limit A — > 0, one finds that 
these quantities have the form of a sum involving curvatures and torsions of the various helices in the chain, plus 
some terms coming from the regions near each junction, depending only on the two neighboring helices. 

For instance we obtain for the curvature (see Appendix ^): 





s,^,)^is, - s) + Sis ~ s,) 8 ^ \.' (5) 



where dots indicate derivatives with respect to s, and S{s ~ Si) , i^(s — Si) are respectively Dirac delta and Heaviside 
theta functions. 

We see that the relevant quantity at the interface is the scalar product • h^+i between the right and the left limit 
in s = Si of the tangent vectors. Similar results hold for the torsion, as well as for any other quantity obtained by 
algebraic operations on curvature and torsion. 

The fact that it is possible to reduce the expressions of curvature and torsion of the whole chain to sums of the 
corresponding quantities for each helix, plus "interface terms" depending only on nearby elements, suggests that also 
the hamiltonian may be built as a sum of "local" single- helix terms with next-neighbour interactions, accounting 
for the stiffness of the chain. In addition to these, a third, non-local term, will describe the interactions between 
non-neighbouring helices. Therefore we write: 

i=l i<j=2 

The protein sequence will come into play in the last term, because the interaction between helices obviously depends 
on the residues they are made of, but will also have a role in the first one, as helices are preferred if they present the 
same local periodicity as the sequence. 

The explicit form of the hamiltonian will be discussed in section |ll C| ; in the next one, we introduce a formalism 
allowing us to treat conveniently the non-local term, which is awkward to handle in the variables that appear in Eqs. 
(i and d). 

B. Dynamical variables 

In order to specify the position and the kind of each helix, we introduce the following variables: 

Nh the total number of helices 

ni=Si-Si^i {n-i G lpi,p2]) 

h ^ \ [si + Si-i -I- 1) [k G [gi,i, q-2,i]) (6) 

Vi = h.i{si) - hj(si_i) 
B,^ i(h,(s,) + h,(s,_i)) 

where p2 — N ~ {Nh — l)pi and pi — 3, since a helix cannot be defined with less than three residues; qi^i = 
i[l +pi{2i— 1)], q2.i = N + ^[1 — pi{2{Nh — i) -t- 1)]. In the above equations rn is the lenght of the i-th helix expressed 
in residues; i G [1,7V/,,]; k represents the position along the sequence of the center of the i-th helix; Vj is the vector 
joining the end-points of helix h^; it is the geometrical analogue of rii; is the the spatial position of the middle 
point of Vi . 

Two other variables, related to curvature and torsion, are needed to completely specify the characteristics of a 



helix: a useful choice, which will allow us to write a realistic potential in a simple form (see Section II C 1 below), is 
to introduce: 
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Zi = , (7 

Ui 

Wi^Ui- 2TT'd{-Ui) , (8) 

where itj has been defined in Eq. Note that Zi ranges between and 1, while w,; e [0, 27r]. The definition of Wi 
allows us to remove the discontinuity between right and left-handed helices at u = ±7r, which is model-induced but 
inevitable in a description of the chain in term of helices. A chain in such a conformation, where there are exactly 
two residues per turn, may be regarded as both left-handed and right-handed, and a little deformation can bring to 
the one or the other type of configuration; yet there is no continuous operation that trasforms a right-handed helix 
into a left-handed one. This is reflected in the fact that one abruptly passes from u = tt to m = — tt while smoothly 
deforming the chain. 

The above variables specify how the chain is partitioned into helices and is embedded in three-dimensional space, 
providing a sequence-independent formalism. The sequence enters the model through new variables qk {k = I . . . N) 
and (/,?«). The former are related to the nature-of each residue fc, and measure its coupling to the other residues. 
In the following we shall refer to Li and coworkerfj who write the Mijazawa-Jernigan interaction matrixE^ as: 

Mp„ = + l-'.i{qp + qa) + l-^-iqpqa (p, cr = 1 ... 20) (9) 
(a slight change of notation is performed here with respect to the original paper). This equation can be recast as: 

Mp^^Qp + Q^ + ^{qp-q„f (10) 

with 

Qp = ^la|2 + ^llqp + {^l2|2)ql . (ii) 

The authors show that Qp correlates well with the hydrophobicity of the residues; for this reason we can call it "the 
hydrophobic charge" of aminoacid p. 

Since we deal with entire helices at a time, and not with single residues, we shall introduce the average g of a helix, 
centered in = Z, as 

J 2;;r+T Sj=-m if/ = 1,2,... 

y 2(2m+l) Z^j = -mW;-i+J + ^f + i+jj, " « - 2' 2' ' ' ' 

(integer or half-integer values of I are the only ones allowed for the central points of the helices, li] the variable m is 
an arbitrary number, comparable with the mean lenght of the helices). 

The corresponding values for the hydrophobic charge Q{1) are obtained from Eq. ( [Ill ) with q replaced by g. Notice 
that this is the correct way of evaluating Q{1), since it is easy to show that the average interaction between ni residues 
on a helix and n2 on another is given by: 



'tl 11-2 -| 

^^l?^2^^ "-i"2 
1=1 ]=i 



ni 712 \ Til 712 



nin2Ho + Ml I "-2 ^ qi +ni'^qj + M2 X! X! I^^i 

4=1 j=l I i=l j = l 



= A^o + Mi (91 +92) +^^29192 > 
and hence is naturally written as a function of the average qi and ^2 on the helices. 

The other variables are related to the local periodicity of the sequence, and are defined in the following way. 
Considering a generic helix of lenght 2n+ 1, centered at the point I along the sequence, the quantity: 

1 " 

p±{l,w,n) = —ji — ^ Qi+k {cos[w{l + k)]ei + sm[w{l + k)]e2) 

is the projection, on the plane perpendicular to the helix axis, of the "hydrophobic dipole moment" calculated at a 
point on the axis, and normalized with respect to the total charge X]j=-n Qi+j (^^^ the sake of simplicity, we take I 
to be an integer in these equations). 

We observe that reveals t he pre valence of non polar residues on one side of the helix, characterized by the 



periodicity w. Therefore, in Sec. II C 1 we shall write a local hamiltonian depending explicitly on 



5 



pi(Z,w,n) 



X] Qi+jQi+k cos((j - k)w) 



(13) 



j,k=—n 



and favouring configurations which maximize p^. 

The dinamical variables previously defined are not completely independent from each other, and the following 
constraints hold: 



1. the sum of the residues of all the helices must be equal to the total lenght of the chain: 

Nh 
i=l 



(14) 



2. the lenght of the end-to-end vector Vi is related to the lenght and shape of the helix: 



- \hi{si) - h,(s,_i)|2 = - n^L^ 



2 2\Sin^(fi'i) 



= 



(15) 



where Oi = niUi/2; 

3. the end of one helix must coincide with the beginning of the following one, both in sequence and in space: 



(16) 
(17) 



In these equations, i ranges from 1 to Nh, and, to be consistent with the definitions of k, we set Iq — 1/2, no = 0. 

From the above discussion the following picture emerges: we describe the geometric shape of the protein by the 
variables 'Bi,Vi,iUi, Zi,ni,li; besides, we give the sequence of "charges" Qk which represent its residues. 

Notice that the above variables do not specify completely the position of the helices in space, because, when the 
constrains are satisfied, there is still a complete degeneracy for rotations of each helix around the vector v.;. 

Actually, given a chain conformation, only one of these degenerate configurations, which are indistinguishable in 
our scheme, corresponds to it. Therefore, provided we choose a good criterion to identify helices out of real chain 
conformations, we have that any protein configuration can be mapped in exactly one model configuration. 

We may ask if the converse is also true: considering a given a model helix, specified by the variables n,w,z,v, with 
its end points pinned at Pi = B — v/2, P2 = B + v/2, we see that: 

a. many chain conformations, slightly different from one another, correspond to it, since real helices are not made 
up by an exact repetition of dihedral angles; anyway, we assume we can consider these to be represented by the 

same geometrical helix; 

b. it's quite unlikely that two (or more) real helices, with shape and lenght well described by the above variables, 
but corresponding to different degenerate positions around vector v, may exist; 

c. it could be possible, on the other hand, that no real helix with the above parameters could fit between Pi and 
P2, due to the stiffness of the chain at those points. 

If case 'c' holds, troubles arise, since forbidden chain configurations could appear as allowed model ones. 

In the following, we shall make the assumption that, given a model helix (of lenght n > 3), with its end-points at 
Pi and P2 , it is always possible to replace it with the corresponding real chain helix, in such a way that no relevant 
perturbation is introduced in the total energy of the protein. 

This is reasonable, since small adjustements of the chain, out of a perfectly regular configuration (yet not affecting 
its overall helical shape), can intervene and prevent forbidden joint conformations, thus reducing the energetic penalty 
to a small fraction of the total energy. We shall come back to the discussion of the energy contributions from the 
helix junctions in the next section, where we discuss the explicit form of the hamiltonian. 
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C. The Hamiltonian 



1. The local hamiltonians Hi 



Two terms will contribute to the internal energy of a helix: the first one, H^, is purely geometric, and takes into 
account the experimental Ramachandran plot to dictate which kind of helices are more likely to be formed. The 
second one, iJ/, on which we commented above, is sequence dependent, and acts as an external field, biasing the 
helices towards a certain periodicity. 

It is important to notice that, if proteins were made by a sequence of exact repetitions of the same (0, %p) angles, the 
association of a geometrical helix to each part of the chain would be straightforward and the geometrical quantities 
w, z of the helix could be written as functions of (0, ip): 



cos(f) = cr(s2)c((/),'(/') , 

\S2\ (18) 



2 . 
Z 



where we have defined: 

a{x) 

and use the explicit expressions 



+1, ifx>0 
-1, if.T<0 



c(0,i/') = asm(^— ) + 6sm(— ^) , 

S2 = ccos( — - — ) + dcos( — - — ) , 

(see Appendix^ for details). The mapping {(ff^ip) ^ i'^^z) in the above equations is two-to-one, as can be seen from 
the study of the solutions of the fourth degree polynomial equations involved in the inversion of Eqs.(^8|). This reflects 
the fact that ((/>, ip) provide a complete description of the geometry, specifying not only the position of the Cq atoms, 
but also the orientation of the peptide planes, which have been disregarded in our approach. 

Dealing with real proteins implies that each helix is associated to a portion of chain where a certain amount of 
irregularity in ((/), ^) is inevitable. This has no practical consequences when the irregulaties are small and the dihedral 
angles are clustered around a particular position (which happens for long elements of secondary structure). However, 
for short coil regions with a great variability in the dihedrals, it would be quite artificial to relate the best fitting 
values of w, z to an hypothetical (</',■!/') repeating couple. It is clear that in the latter case the relationship between 
{w, z) and (0, "0) is weakened. For this reason we shall make the simplifying assumption that w, z can be assumed 
as fundamental variables, and that the helix-model configurations are in a one-to-one relationship with them. Since 
(w, 2:)-couples that cannot correspond to any real protein configuration are introduced by this ansatz, we shall write 
the hamiltonian in such a way that these values have a vanishing weight in the evaluation of the partition function. 

The choice of an explicit expression for the hamiltonian requires a careful analysis, because no reliable potential 
function based on first principles is known. This is not surprising, if one thinks that each residue is itself a many-body 
system, usually in interaction with the solvent molecules. On the other hand, the fact that even a crude hard-sphere 
model for a dipcptidc reproduces the experimental Ramachandran's maps in an essentially correct way (compare 
for instance Fig. (12A, 13A) in Ramachandran's articlecS with Fig. (5) in Morris et a/.c3), implies that a simplified 
description of the potential function should be possible, and suggests its main characteristics. 

We already mentioned the fact that, even in the case a perfect helical-shaped chain, the variables w, z give a 
description of the chain geometry which is less detailed than that provided by (0, ■0) angles, and even more so in 
comparison with an all-atom description. Despite this, it is possible to write a potential function in the variables w, 
z which correctly reproduces the main features of the Ramachandran's plots, in the form 

= - 1)70 ci ((wi - 02)^ - ca)^ + C4 + C5 (zi - cg -I- C7(wi - cs)^)^ . (19) 

Here the c^'s are fixed adimensional constants; the factor (n^ — 1) takes into account the feature that each residue 
in the helix feels the same potential, except the last one, which corresponds to the junction with the following helix, 
and must be treated in a different way. In Figs.(Q,H) we plotted the contour lines of this potential, with — 1 = 1, 
respectively as a function of w, z, and of their images on the (</>, "0) plane. 
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Note that, since h/a « 1/20, there is nearly symmetry under the exchange of (p and ij), so that there is an intrinsic 
difficulty in distinguishing the region above from that below the diagonal 4> = tphy means of the quantities at the right 
member of Eq.(|l^). This can be considered a minor problem, since the physically allowed region roughly coincides 
with that above the diagonal (j) = ijj and we aim to study, as a first approach, only the behaviour of a system which 
presents two energy minima roughly corresponding to the a- and /3-regions of Ramachandran's map (as a consequence, 
we disregard left-handed a-helices, since a residue in that position usually belongs to a turn, and not to an actual 
left-handed helix). 

The above ad hoc hamiltonian provides a correct qualitative description of the phenomenological results, without 
introducing expressions more complicated than a fourth degree polynomial. In the following we shall consider it as 
an unperturbed hamiltonian, to which the non local-interactions add as perturbations, which remove its degeneracy 
without affecting its overall shape in plane 

Coming to the sequence dependent part of the local hamiltonian, a very natural choice is that of taking 

Hi ^ -jin,P{k,w^) , (20) 

where 71 is an appropriate dimensional constant, and P{li,Wi) = T{'p\{li,Wi,n)) is some simple function of 
{li ,Wi,n). For instance could be either the average of p5_ calculated for different accessible lenghts of the 



helices (the values of n in Eq. (^^), or pj^ itself, evaluated with a particular phenomenological mean value of 



n. 



A more detailed study on the best expression for is left to future work on the subject: in the following, we shall 



choose a particular form for J- only when, in section HI , we shall study the ground state of hamiltonian Eq. (EOf) for 



a small synthetic protein, showing that P{li,Wi) can indeed provide partial information about the native state. 



2. The interaction between neighbouring helices H, 



So far we have disregarded the contributions to the energy coming from the aminoacids at the junctions between 
helices: they have been kept out from Eq.(|l^) thanks to the factor (n^ — 1). When we try to keep them into account, 
we immediately face many difficulties, since there is apparently no natural way to relate exactly the microscopic 
potential determining the possible (0, V') values to the description of the chain in terms of helices. 

We saw in Eq.(||) that, at the junctions between helices, the curvature depends on the scalar product of the right 
and left limits of tangent vectors in those points. This would suggest to write an interaction penalizing discontinuities 
in the tangent vector; yet, this is quite awkward for the following reasons: 



while it is possible to relate curvature and torsion of a helix to the dihedral angles specifing the chain, it is 
extremely complex to do the same for the scalar product • h^+i: the knowledge of (</>, "0) at the the junction 
is not sufficient to specify the direction of the tangent vectors (remember that the Cq atoms of the chain do not 
even lie on the helices, due to the requirement that lenghts be the same when measured along the (continuous) 
helices or the (discrete) chain); 

even if it were possible to find a mapping, relating the dihedral angles at the junction to the helix geometrical 
description, in terms of tangent vectors, uncontrollable mistakes would be done in evaluating the energy. In 
fact, real helix junctions are different from ideal ones, since structural adjustments are allowed in real chains to 
minimize the energetic cost of the junction which cannot be described by ideal chain geometry, where helices 
are stiff. Hence, if we are interested in a reasonable estimate of the energy, a detailed description of the tangent 
vectors' dependence on the (</>,'!/') angles at the junction is essentially useless. 

Finally, tangent vectors are very d ifficu lt to write down within the adopted formalism (because of the degeneracy 



under rotations, discussed in Sec. 11 B, even if we don't relate them to real chain quantities. 



Therefore, we have to write down an expression for the interaction energy between neighbouring helices without 
relating it to the real chain geometry, and without resorting to tangent vectors. The natural candidates to appear in 
such an expression are, of course, the vectors v.;, v^+i, but the functional form to be chosen is by no means obvious. 

In principle one could study first the total energy of two successive secondary structure elements (for instance, by 
looking at dihedral angles and applying microscopic potentials) as it comes out from phenomenology, then represent 
such elements with model helices, and finally get the interaction energy as the difference between the total and the 
sum of the single helices'. It would be possible, in this way, to relate the interaction energy to the relative positions 
of Wi and v^+i. 
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Yet, as a first approach, we assume this energy to be a costant, at each junction, regardless of the values of the 
dynamical variables: for instance, we can take the mean energy of the residues calculated with the Ramachandran's 
map distribution. Since Nh — 1 is the number of junctions, we set: 

Hnn=l2{Nh-l) . (21) 

This simple hypothesis entails an important effect: if one neglects the sequence-dependent hamiltonian Hi and the 
non local interactions, splitting up a helix in two pieces with the same {w, z) as the former one involves the substitution 
of a residue of energy Hf{w, z) with a residue of mean energy 72. Hence helix breaking will be penalized for helices 
with "good" values of {w,z) (a-helices and /3-strands, for instance), and favoured in the opposite case. 

This is very important, because both entropic effects and non-local interactions, as we shall see below, would favour 
configurations with many short helices, regardless of [w, z) values: i?,m competes with the above effects, allowing, in 
principle, the existence of equilibrium states of the model presenting long elements of secondary structure. 

3. The non-local interactions Hij 

The modeling of non-local interactions requires a careful analysis of their nature and characteristics. Two different 
contributions are to be dealt with: hydrogen bonds and hydrophobic interactions. The former are responsible of 
orientational preferences of the couplings between elements of the secondary structure, but are believed to play 
an insignificant role as a driving force for folding, since hydrogen bonds to solvent are of the same energy than 
intramolecular ones. The latter are responsible for the collapse of the chain to globular states, but are very difficult 
to model, since they come mostly from entropic effects invoving the solvent, and not from a true coupling between 
residues. For these reasons, we abandon the idea of writing non-local interactions on the grounds of microscopic 
considerations, and once more resort to phenomenology. 

First of all we notice that typical distances between the axes-of interacting secondary structure elements in ±he 
native state range from 0.46 nm (two hydrogen-bonded /3-strandscj) to about 1 nm (two a-helices or two /?-sheets,E3). 
Therefore we simply write down an attractive square- well potential in the variable ABij = |Bi — Bj|, taking as 
reprentative of the i-th helix, with a hard core repulsion preventing overlap between helices. 

Then, we look at the phenomenology of non-local interactions in the native state, considering at first only the 
hydrophobic effect and disregarding hydrogen bonds in /3-sheets. We see that usually two elements of secondary 
structure tend to pack as closely as possible, just due to the hydrophobic effect. For geometrical reasons this usually 
means that they cannot be parallel; thus the number of residues which are actually into contact jis independent of 
the lenght of the helices, and also, roughly, of their characteristics. If, following Li and coworker^j, we take as the 
"microscopic" contact interaction between two residues that given in Eq. we can write for the interaction between 
hehces: 

H^3 = Hpi - ABy>9(ASy - po) [73X (a^o + Ml iq{k) + + f^^qikM,))] + 

+74i?(po- AS,,) , (22) 

where q{li) are the quantities defined in Eq.(^2|); x is the average number of contacts between residues in two close- 
packed elements of the secondary structure, 74 3> provides an hard-core repulsion when the distance is less than 
Pq] Pi is the range of the attractive interaction 73 > "normalizes" the interaction with respect to the other terms in 
the hamiltonian: again, it should be small compared to 70. 

Coming to the hydrogen bonds in a /3-sheet, we see that they tend to align the two interacting strands, independently 
of their charge. Yet, it is known that /3-sheets show very little stability when exposed to the solvent, because residues 
easily form hydrogen bonds with water. In our approach, where the solvent is taken into account implicitly in 
the coupling strenght, one should relate hydrogen-mediated interactions to the geometry of the helix and to the 
surrounding environment (in order to distinguish between exposed and buried sheets). Hence, hydrogen bonds between 
/3-strands should depend on the overall hydrophobic charge of the environment they are embedded in: this is far too 
complex to be described exactly. 

For the sake of simplicity, we shall use Eq.(p^) also to describe interaction between /3-strands, neglecting the ten- 
dency towards alignment. A more detailed representation would involve the introduction of terms involving • Vj, 
and also, perhaps, of Vi_i A v^+i • v^, to account for right-handedness of super-secondary structures like /3-X-/3. 

As a result of the above discussion, the complete hamiltonian of our model, also including the constraints, reads: 

Nh Nh Nh 

H = i/„„ + ^(i/° + Hl)+ J2 H^,, + V" + Y.{Vl + VU., + Vti,,) , (23) 

i=l i<j=2 i=l 
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where, recalling here all the results for the sake of clearness: 



l2{Nh-l) , 

(rii - 1)70 ci {{wi - - cs)^ + C4 + C5 [zi - cg + c-!{wi - cs)^)' 
d{pi - AB,y )i9(AB,y - po) [73X (mo + A^i (^('O + Wj)) + M25(«iM;j))] 

V° = Ao(^n.-iV) , 



2 r 2 2\ 



V:, 



•^2,i(Bi — Bi_i — — — — ^) 



The Aa,i's are Lagrange multipliers that allow us to insert the appropriate constraints in the hamiltonian. 

Notice that we have introduced an approximated expression of Vl (compare it with Eq.(p^)): this is possible 
because, for the most significative regions of (w,z)-plane, with any allowed value of rii (remember that rij > 3), the 
M-dependent term in Eq.(|l^) is negligible. 

Without loss of generality, we can moreover take C4 = 0: in fact, C4 can always be eliminated by the transformation: 



Ao = Aq - 70C4 



72 = 72 + 70C4 , 



whereby the hamiltonian changes of the constant term (N — 1)7004. The way we have chosen to implement the 
constraints is particularly suitable to carry on some analytic calculations on the model. Obviously, it is not the only 
possible one, and different choices may be useful in different approaches. 



An important remark concerns -/?„„: the choice of an expression independent of Vj introduces possible symmetries 
in the model that could be exploited to some extent. If, in fact, we disregard the sequence, taking = g as a 
constant and neglecting H^, we see that, given a set of Bi, and a choice of n^, Wi, Zi, providing a total internal energy 
E'^ = X^il^'i we can certainly change the values of n^, Wi, Zi, together with the vectors v^, in such a way that 
is unchanged and the Bi fixed, so that also the non-local interaction remains the same. This symmetry is reduced, or 
removed, by the introduction of the real charges qk, but this has to be studied independently in each case. 

The picture that emerges from the above hamiltonian is that of a complicated interplay among dynamical variables: 
a helix of a certain shape and lenght, specified by w,z,n, will be attributed an energy based on H'^ (which is related 
to dihedral angles conformation), plus a sequence-dependent contribution depending on its position I along the 
chain. The values of z and n then determine, through constraint V"'^, the lenght of the vector v, and through V2, the 
spatial coordinates B of the helix. The latter, in turn, determine whether the helix considered interacts with other 
helices by the term Hij, where the charges q(l) again depend on the internal coordinate 

For the above reasons, the hamiltonian Eq.(|25|) is inevitably complicated. It should be remembered, though, that 
it describes a generic protein, with any sequence, and within a very realistic framework, which deals directly with 
secondary structure elements. Moreover, the model involves a reduction of the intervening number of independent 
dynamical variables (5iV^ against N couples ((/), ?/;), if A'^ is the number of residues, with a rough estimate for the ratio 
as 5Nh/2N « 1/3), so the shortest proteins could lie within the reach of numerical studies. 

In this case predictions of the model can be compared directly with experimental findings, which could remove the 
need of an exhaustive search for the ground state, that is the starting point of many lattice models currently studied. 
Quantities like: 

i—l 

measuring the distance of equilibrium structure from the experimental native state (the average is taken e.g. in a 
canonical ensemble), as well as its analogue referring to line coordinates: 
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i—l 

can give information on the most appropriate choice of the parameters in the hamihonian, and on the folding transition. 

Another important topic to be investigated is that of the identification of order parameters, which could characterize 
the folding transition in an intrinsic way, with no reference to a native state known a priori. The study of the 
temperature dependence of correlation functions could possibly distinguish a true folding transition of a "good" 
sequence from the freezing of a "bad" one into any minima of a rugged landscape. 

In order to get information about such thermodynamical quantities, we must be able to evaluate the partition 
function associated to the protein hamiltonian H. This is of course a very complicated task, and deserves a complete 
and specific analysis which is beyond the scopes of the present paper. In the next section we shall show, anyway, 
that one can resort to the study of simplified models and get indeed useful information both on the protein under 
investigation, and on the best way to extend the analysis to the more general case of the complete model. 



III. A SIMPLIFIED MODEL 



In this section, we shall mainly deal with a simplified version of the model, where only the local terms iJf + Hi are 
kept into account and the number of helices is fixed. Eventually wc shall add non-local interactions to it as a small 
perturbation, in order to retrieve information about the spatial conformation of the protein. 

When dealing with the local model, in fact, we loose the description of the spatial structure, but we are left with 
a highly non-trivial model, with the sequence coming into play through H^, which provides interesting information 
about both the native state and the relative importance of the interactions stabilizing it. Indeed, the requirement of 
maximizing the separation of hydrophobic charges on the helices generates a scenario in which the most anphiphilic 
helices tend to compete with each other in order to grow as long as they can. The equilibrium configuration one 
finds in this way specifies how the protein should be partitioned in secondary structure elements to obtain the 
highest anphiphilicity. Therefore, it provides some important insight on the secondary-structure composition of the 
native state, so that it is natural to ask oneself if, at least in some cases, the three-dimensional structure could be 
superimposed to the resulting secondary one, introducing non-local interactions as a small perturbation driving the 
helices to the correct configuration. 

In the first part of the, present section we indeed siiow that, for the simple synthetic protein (already studied by 
Kolinski and coworkcrsEa and Raleigh and DeGradoB) specified by the sequence GEVEELLKKFKELWKG PRR 
GEIEELFKKFKELIKG PRR GEVEELLKKFKELWKG PRR GEIEELFKKFKELIKG, the above scheme may be 
successfully applied. We shall in fact find the set of n^, Wi corresponding to the minimum of the local hamiltonian; then, 
we shall switch on the non-local interaction as a small perturbation and eventually find that native-like configurations 
are indeed the ground state. Encouraged by this result, we shall pursue the study of the local model and, resorting 
to some general assumptions on P{li, uui) and to suitable approximations, we shall find out - in an almost completely 
analytical way - an expression for the partition function of a generic protein, which can represent a good starting 
point to study the thermodynamics of the complete model. 

Both these investigations are intended as preliminary tests, aiming to demonstrate on the one hand that the variables 
we have chosen accurately describe the sequence, and on the other that the model we propose, despite its complexity, 
is indeed analytically manageable, at least in some simplified case. 

We start with the hamiltonian: 

Nh 

Hi{w,},{n,},{h}) = Y.{H°iw,,n,) + Hliw,,h,n,)) . (24) 
where the constraints are exactly implemented, through the equations: 

h = i(ni + l) + El=\'^fc (r,r. 

We assume that the function P{li,Wi), appearing in the expression of Hf (Eq. (|20|)), has the form: 

pn ^„ ^ _ j P±{h,w„i), if is an integer , , 

^ ~ U -^,w„3)+pl{k + ^,w,,3)], if = fc + i, for integer k ^^^> 
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We choose n = 3 in expression (|3|) since this involves calculating the hydrophobic dipole on an helix of seven residues, 
a reasonable lenght both for a-helices and for /3-strands. We assume that 71/70 1, so that we can approximate 
the minima of H with those of Hf, as far as Wi is concerned. In this way we are left with only two values for the Wi, 
namely woa, wop, which in turn entails that we can forget about H^, whose minima are symmetric, and only deal 
with Hi . 

Since the n.i are integers and the li are integers or half- integers, we are left with a discrete configuration space, 
whose size depends on the number of helices we are considering. 

Obviously, we have 2^'* configurations for the set of Wi, but, once given the set of positions li, one "a-priori" knows 
which has the lowest energy, by a direct comparison of P{li, woa) and P{li,woi3) (see Fig.(|3|)). Hence, we have to find 
the energy minimum in a space that contains as many points as the number of possible partitions of N residues in 
Nfi helices which have a minimum lenght of pi residues. One can easily convince oneself that this number is given by: 



J=0 



For the protein considered we a priori know tlpjt its native state is made up of four a-helices (indeed the sequence 
has been designed to produce a 4-hclix-bundlcE3) , so we try Nh = 4 and ask ourselves if our model will be able to 
find out the correct position, lenght and kind of helices. 

We set 71 = 1, C4 = and exhaustively searched the configuration space with TV = 73 and Nh = 4, which contains 
TT{Nh) = 41664 points. We found that (77.1,712,713,77.4) = (17,19,20,17), with all the helices being a-helices, is the 
ground state of our semplified model. This result is in excellent agreement with the experiment, and suggests that, 
at least for some proteins, the only requirement of maximal local anphiphilicity may be enough to get the correct 
composition of the secondary structure. 

Encouraged by this result, we switch on the non-local interactions Eq. ( p2| ) as a small perturbation (73 <C 71) to 
the simplified model ([2^), and try to predict the tertiary structure. We procede as follows: relying on the fact that 
the gap between the ground-state and the first "excited state" of our simplified model is necessarily finite (the ni 
may only assume integer values), we freeze the secondary structure (711,712,713,714) we have obtained and look for 
the minimum of Hint = X^t^i"^ "llfj^i+i ^ij^ where Hij is given by Eq.(22). We choose in that equation the values 
po = 5 A, pi = 9 A, and show that the bundle-like configurations are indeed the ground state (even if degenerate) 
of the model. First of all we notice that, once fixed the length in residues rii of each helix and its Zi according to 
the fact they are all a-helices, we have a unique set of Vi, coming from Eq. (|l|). Then, we see that, upon defining 
di = Bi+i — Bi, the set of twelve cartesian components of the four vectors are subjected to the thirteen equations 
(see Eq. (^): 

d, = i(v,+v,+i) (j- 1,2,3) (28) 

v^^v^ (7 = 1,2,3,4) (29) 



One of the above equations represents a constraint on the allowed d^: it is easy to see that Eqs. (^9|) lead to the 
explicit expression: 

vl -vl+ Ml - 8d2d3 -I- ^^-^ f di A d2) X 



di A d; 



(d3 A di) A2 - (da A + aodsJlGv^ (di A d2) - (diA2 + d2Ai 



0, (30) 



where we have written d^ = didi {di, i — 1,2,3 are unit vectors) and we have defined Ai ~ {v^ — vl + 4:df)/di, 
^2 = {vl — V2 — 4^2)7^2 and (To — sgn{di A d2 • Vi). 

The above constraint, together with the requirement that all the distances among the helices range between po and 
pi, defines the ground state configuration of our protein. 

It is straightforward to see that Eq. ( |30| ) has solutions on the plane: for instance upon choosing d3 = — di, we see 
that any configuration satisfying: 

p2 + i I „2 _ 7;| \<dl+dl<pl-^\ vl ~ vl I (31) 
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is a solution, and indeed a bundle-shaped one, because of the geometrical conditions we have imposed. These solutions 
share the same topology and present no barrier in between, since the corresponding domain of (di, 1^2) plane is simply 
connected (Eq. (|3l|)). For these reasons they can be considered as small deformations of the same "native state". Of 
course solutions exist which are not planar, but they are difhcult to single out analytically. Hence, we resorted to 
numerical calculations to find some of the energy minima, which, due to the oversimplified, square well interaction 
potential Eq. (p2|), are highly degenerate. First, we characterized the conformations of the protein by spherical 
coordinates Vi = {vi, 9i, (jji) calculated with respect to vector vi. Since the lenghts are known, and we can set (j)2 — 0, 
we were left with the five angular coordinates 62, O3, 64, ^3, 4>ij to describe any configuration. We also calculated, for 
each configuration, the "lack of planarity", as given by the volume F = di A d2 • ds, and the distance (3bun from the 
"most bundle-like" conformation(the one with the highest degree of parallelism/antiparallelism between vectors): 



^hun 



1 (cos(aij) -cos(q°j-))^ 

6,^2 (cos(a-)-cos(aJ^.))' 



(32) 



Here a^j is the angle between vectors Vj and Vj, and o;^^-, a^- are its best and worst values, in relation with the bundle 
configuration. 

We performed a random search in configuration space and found nearly 10* ground state configurations; then we 
excluded those presenting a residual steric hyndrance among vectors vi, V3, and V2, V4, since the condition AB^ > po 
on the central points of helices does not prevent their other points to come too near. 

Afterwards we grouped the remaining structures, identifying those globally differing less than \A. At the end of 
this process, we were left with only 4 % of the original configurations, which we sorted according to their Phun values. 
We found that configurations with the highest degree of parallelism among the v^'s are also those with the most 
planar set of vectors di. The best and the worst structure are presented in Fig. (Q) and Fig. (^), respectively: they 
correspond to the values f3bun = 0.031, V = 0.072 and (ihun = 0.32, V = 0.64 respectively. Notice that, even though 
our model does not single out a unique native state for the proposed protein, it is nevertheless very encouraging that 
its results can be easily pruned, according to some simple considerations of excluded volume, leading to essentially 
correct configurations. This, in spite of the many simplifying assumptions made. 



Having investigated the characteristics of the ground state of the simplified model, we now come to the study of its 
thermodynamic properties, and write, under certain simplifying assumptions, its partition function. 
We consider again the hamiltonian: 

H = V'' + + Y,iH^ + . (33) 

1=1 

and perform some further modeling on the explicit expression of H^, under the assumption that Eq.([20[) contains 
more more details than needed. To this purpose we keep into account only the most relevant maxima in P{li,Wi), 
whose positions we specify by (^o,!^, Wi,). 

The general requirements that has to fulfil are then the following: 

• along the w-axis, it must present minima which are simmetrically disposed around vr, since is invariant under 
the exchange w ^ 27r — w (corresponding to the inversion of handedness of the helix) ; 

• in the physically interesting domain, li e [<Zi.i, q2,i] and Wi € [— 7r/3, 57r/3] (which is the image of Ramachandran's 
plane ((/>, ip) ), Hi must amount to a perturbation of iJ^", hence it must be bounded within a range small compared 
to 7oCic|, namely to the difference between the maximum and the two (symmetric) minima of Hf; 

• the width, shape and depth of the minima of must resemble those of P. 

In the following, we shall assume that, near positions lo,u along the sequence, only two minima in w are present, 
and they are placed at Qi^^ — Wy and ^2,v — ^ir ~ Wi,- We choose for H} the expression: 

M 2 

Hi = -n,Y,Y.^i5-^^,^',a , (34) 

1^—1 a—l 

where we have defined: 
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Gi^i^^a = exp{-[ 1 } (35) 

H\ amounts to a collection of 2M. gaussian wells, whose overlap will be considered negligible for all practical 
purposes, which have a depth 7i(5^ (with 71 ^ 7oCic|), are centered at positions Zo,i/ along the Z-axis and positions 
^a,v along the w-axis, and have widths We also assume that, at the end points of the chain, all the 

gaussians Eq.(p5|) are negligible, which will allow us to extend the integrations to the range li G [—00, 00]. 

Notice that, while 71 is a tunable parameter, S^, are given (sequence dependent) positive constants, corresponding 
to the ratio between the depth of the ly-th minimum and that of the deepest one. Without loss of generality we can 
take < 1. 

Expression (|34[), ( p5[ ) for the hamiltonian relies on the implicit assumption that the height, position and width of the 
maxima of P{l,w) are more important than the details of its shape, so that a coarse grained description is sufhcient. 
Of course, the use of gaussians is arbitrary, and is dictaded by the fact that they allow us to model accurately the 
shape of P around its maxima, and decrease rapidly to zero, preventing or reducing spurious overlaps. 

In order to perform the calculations, it is useful to write the constraint V3 appearing in Eq. (^3|) as: 

^hY^ik-Uf , (36) 

1=1 

where we have introduced, remembering the definition of If. 

Li = i {si + Si+i + 1) = ^'^^ + ^ . (37) 



fc=i 



Our goal is to evaluate the partition function, so we write 



Y[e-P(H"+Hi) , (38) 



i=l 

and introduce two approximations, in order to make an analytic integration possible. Namely we replace: 
exp {-(3{n, - l)7oCi[(wi - €2)^ - cg]^} ^ 

-/°-exp{-/^(n.-l)i^^:i-i^}+exp{-/^^^^ , (39) 

where 

Ml = C2 - , = C2 + , = . (40) 

470C1C3 



We also write: 



M 



-0"' =fl^l + ^^exp{-/3e.(n.0[ ^^' ~ 1""^' + ~ ^'^^ ]} , (41) 



2 



1^—1 a—1 



where 



e.(n,) = n»7i'^. gg„,^^A7 _i ' (42) 

Notice that in this case both e^^^^ and fl tend to one as we move far from the maxima: nevertheless, this fact 
does not bring any further difficulty in the integrations involved in the partition function, since the dependence of 
H on Wi is dominated by Hf, and that on li (which anyway ranges between the two finite values qi^i and (?2,Ar^, 
see Eq.(||)) by V^. With the above approximations, which are thoroughly discussed in Appendix ^ we are finally 
able to calculate the partition function. First of all we integrate on Wi and Zi, performing the change of variable 
Ci = Zi — cq + c-j(wi — cs)^, which does not affect the jacobian: 
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Z{{n},{l}) 



- p-/3(V«+V^) 



n 



dwdU-fleM-Ploc^{n, - 1)C-] 



We insert Eqs.(|39|, ^ ) in the above expression and perform the integrations on C,i, Wi and then the summations on 
li, which we extend to thcj'ange [—00, 00]. EventuaUy, after some lenghty calculations, and resorting to the definition 
of elliptic theta functionscJ, we can write: 



Z{{n}) 



Nh 



n 



2cr7r 



M 2 



{Aa,i.Ani)-C^{{nk}M)) . 



u=l a,fc=l 



A3) 



a:i/(ni,A3) V A3) 

where 9^{z, q) is the elliptic 9^ function of argument z and nome q, and we have defined: 

1 

,ni-l e^{ni). 



(43) 



{rii - l)7oC5(- 



^a,&,!^('T-j:) 



i'^ ■ 



A3) = - — ^/W^Ani) + \3r]^ , 



y^i{nk},X3) 



2\hM 2 ^ A3Li({rifc|)) 



ei/(rii) + A3772 



a({nfc}, A3) = /3^4^r^(^o.. - i.({n4))' , 

e^(ni) + A377^ 

At this point we should perform the sums over n^. They are clearly unfeasible in analytical way, yet, they are very 
simple to perform numerically. To this end it may be useful to implement directly the constraint Vq by setting 

fc=l 

and paying the necessary attention to discard the collection of n^'s leading to negative values for njVh- 

In this way, for each given sequence one can obtain an expression for the partition function which depends only on 
/3, A3, and the ratio 71/70- The Lagrange multiplier must be evaluated by minimizing the free energy. This involves 
the condition 



dZ 



, 



that should be solved to give A3 = A3 (/3, 71/70). 

We shall then be able to study the thermodynamic behaviour of the system at different temperatures and values 
of 71/70- We leave this detailed analysis to future studies, since our goal here was only two show that, despite the 
complexity of the model, interesting calculations, providing new insight in the folding process, can be performed 
without resorting to heavy numerical work. 



IV. COMMENTS AND CONCLUSIONS 



This paper is mainly devoted to the presentation and analysis of a new model hamiltonian to be used in thermo- 
dynamical and dynamical studies of the folding process. The various contributions to the hamiltonian (p3|), and their 
relation to phenomenology, have been thoroughly discussed, together with the approximations introduced. 

Then we conce|tttrated on the local part of the hamiltonian, and applied it to the study of a small (73-residues long) 
synthetic proteinEj, designed to produce a four-helix bundle. 
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Even if this part of the model does not contain information on the spatial structure, it takes into account the 
sequence, so we expect that it can provide relevant information on the secondary structure composition of the native 
state. To prove that this is indeed the case, we studied the local hamiltonian minima for the protein considered, and 
then switched on the non-local interactions as a small perturbation, thus superimposing the tertiary structure to the 
existing secondary one. 

We found that the four-helices ground-state is correctly made up of a-helices, placed in the same positions along 
the chain as they are found in the experimental native state. Encouraged by this result, we studied the best spatial 
configuration that the four helices we obtained would assume, due to the non-local mutual interactions. Again we 
obtained a positive result: bundle-like configurations are indeed the ground-state of the model, even if the latter 
appears to be degenerate, due to the small number of helices involved and to the highly simplified expression of the 
interaction hamiltonian H^j. It is very likely that the degeneracy would be partially or completely removed by the 
introduction of an interaction potential depending not only on the distance between helices, but also on the mutual 
orientation of both the hydrophobic moments and of the helices themselves. 

Finally, resorting to some simplifying assumptions, we showed that the partition function of the simplified, local 
model can be evaluated almost completely in analytic way. 

A detailed study of natural proteins with the above approach is left to future work. As already mentioned, we also 
leave to future efforts the refinements regarding, for istance, the explicit expressions for P, appearing in Eq. (|2C|), and 
for the charges (p^. In this paper they have been fixed in an arbitrary, though reasonable, way just to perform some 
tests on the model. The same holds true for assumption (^J), whose validity could depend on the sequence considered 
and requires further analysis. 

Coming to the partition function for the complete model, a reasonable goal is to perform exact, analytic integration 
on some of the variables (for instance, Wi, Zi, li), thus providing an effective interaction potential among the others; 
then, one could resort to numerical simulations. The latter could be approached, for instance, putting the B.^ on a 
lattice: in this way one could have a true mapping of real proteins on lattice models, and the approximations induced 
by the lattice could be better controlled in their relationship to protein geometry. 

The fact that the known native state of a real protein can be mapped onto a model configuration entails also that 
the study of the inverse folding problem can greatly benefit from this new approach. 

Finally, it is easy to provide a coarse-grained dynamics for the protein through the variables used in the model, 
and our future efforts will be dedicated also to this line of research. 
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APPENDIX A: 



Curvature and torsion of the curve r(s) are defined: 

I r A f I 



r A r • r 

If A 



(Al) 



Performing the calculations for the curve in Eq.(^, and keeping only the leading terms in the limit A — > 0, wc find: 



f A 



J2 (&»h» + 6»+ih»+i) -fc?+ih?+i 
,i=i L 

X! ( + A h,+i) + (6,h, -I- 6i+ih,+i) A (&,h, + 6,+ih,+i) 



r /\r ■ r 



6t+i(h.+i Ah,+i] 
=° ^ [fo^h, Ah^■h^ + {b, - h+i) \&gl{\l^ A h.+i)(-h. -f h.+i)- 
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2gi{bihi + b^+ihi+i) A hi • hi+i + 3gi(6jhi + bi+ihi+i) A • h^+i + 

In the above equations, dots as well as greek apices indicate derivative with respect to s, and we have introduced 
ho = hAT^+i = 0. 

Now we observe that the above quantities, in the limit considered, are made up by terms proportional to powers of 
&i, that are different from zero in the region Si_i < s < s^, and terms that live in the interfaces s = Si between helices, 
namely those containing the product 6^6^+1, or the delta function g^. These terms are mutually "orthogonal", in the 
sense their product is zero. This fact allows strong simplifications, because one can perform algebraic manipulations 
at fixed s (this is allowed, since we are not differentiating), and consider only those terms which are different from 
zero for that particular value of s. For instance, the curvature becomes: 

~t\ |h, + h,+i|3 J 

where •& is Heaviside's function, is the curvature of the i-th helix: 

I h» A h» I 



and the delta-like functions 



h. |3 



1 , if s ^ Si 
, otherwise 



have been introduced to single out those terms that live at the interfaces s^. Obviously, the term containing S{s — Si) 
is the only one to take into account, which leads to the result in Eq.(||). 



APPENDIX B: 



In order to study the relationship between geometrical quantities and (</), ip) angles, we observe that a protein may 
be built up by performing a sequence of rotations and translations of the peptide plane C^C NC^^i- 

If L is the segment joining to successive C", and if we label j the peptide plane between C" and Cj^i, we have 
that the position of is given, in the reference frame of peptide plane number 0, by: 



N-l 



1=1 



L 



R{(bui>i)L+ ••• +i?(,/)i,Vi)----R(0jv-i,?/'jv-i)L . (Bl) 



Following RamachandraiiE3, we associate to each peptide plane j a reference frame (xj , , kj ) , such that the origin 
sits on C", Yj — Lj and Xj lies on the plane, with Xj • CO > 0. 

Now we consider two successive planes, labelled and 1, and take </>, to be the unit vectors, expressed in reference 
frame 0, of the rotation axes iVCf and Cf C", when (p — — (according to the standard conventions, this corresponds 
to having CN, NCf, C^C in "cis" configuration, lying on the same plane with CN ■ CfC < 0). 

Given frame 0, the sequence of rotations to perform in order to obtain frame 1 is the following: first one rotates 
by an angle tt about the y axis, then of an angle 9 — —tt + Cg — — (tt— | a | + | C I + I I) about the z axis, 

to recover the standard cj) — ip — configuration. Here a = NC^C, (3 = C'Cf C^, and C = C^CfN. Then we can 
perform the ^/'-rotation, and successively the 0-one, obtaining: 

y.\^l^l[R{<t>)R{iP)R{e^)R{^y)]ba , (B2) 

where all the rotation axes are expressed in reference frame 0, (X{,X2,X3) — (xj,yj,Zj), and 

[i?(j7)]p, = cos(??),5p, + (1 - cos(77))77P77« - sin(77)77'-£'-f« (B3) 
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(sum over repeated indices is understood; rf are the cosine directors of rotation vector rj). 

The above product of rotations is equivalent to a single rotatioEca R{ijj{4>, ip)) with the argument u){4>, ■0) satisfying: 



c = c{uj) = k {s{'tp)c{(f>) + s{(j))c{tp)) , 

s = s{uj) = k [s(?/')s(0) - c((p)c{ip)] - K A (s(?/') A s((/))) + k A (s(i/')c((/)) 
where, for any argument 77, 0(77) = cos(77/2), 8(77) = f)sm(ri/2); while 



(B4) 
(B5) 



K, 



(-sin(-),cos(-),0) 



^=(sin(|C|),cos(|C|),0) , 
•0 = (sin(| a I - I C l),-cos(| a 



Cl),0) , 



The above equations specify completely the rotation matrices appearing in Eq.( ]B2| ). 

Now we have to relate the parameters identifying a helix, to the repeating value of the angle of rotation a;(0, ^p)■ 
Let us take three successive C" — C" segments, namely L, L', L" , and put ourselves in reference frame of segment 
L'. We have L = '(/')L' and L" = V'jL' , whence 



L — L{2{siS2 + CS3), 
L' = L{0,1,0} , 



+ 2S2, 2(S2S3 - CSl)} , 



L" ^ L{2{siS2 - CS3), c^-s^ + 2sl 2(s2S3 + csi)} , 

where c — c(a;), s = sin(ct;/2), Si — LOi sin(tL'/2) (si are the components of the vector s(a;)). 
Now we ask that the helix axis 63 satisfies the following requirements: 



L • 63 = L' • 63 



■ 63 



L' 



63 



The axis orientation is fixed by: 

L • 63 > , 

and, to distinguish between left- and right-handed helices, we assume that: 



sgn(h A L' • 63) = 



+1, 
-1, 



if right-handed 
if left-handed. 



(B6) 
(B7) 

(B8) 



We need also to relate the lenght of the helix arc, corresponding to one peptide segment L of the chain, to the angle 
of rotation at each site. From the definition of u (Eq. (^)), we have that As = 1 corresponds to an arc of lenght L; 
hence u is the rotation about the helix axis corresponding to a displacement of one peptide unit along the chain. This 
angle must be equal to the projection of lu on the plane perpendicular to 63, namely 



cos(m) 



63 A (LA 63) 63 A (L' A 63) 



Equations (B6)-(B8) yield: 



L A 63 



63 



L' A 63 



(B9) 



(BIO) 



where a = ±1. Observing that u e [0, 27r] {uj G [— tt, tt] would not be correct, as c{lu) in Eq. (B4) can also be negative), 
we always have s > 0, hence 



63 



(Bll) 



From Eq.(B9) we finally get: 



cos(m) = cos(w) 



To establish the sign a and the explicit expression of u, we use Eqs.(B7),(B8), and find that the product S2C 
determines the handedness of the helix (right-handed if positive), while a coincides with sgn{s2) whenever S2 is 
different from zero. The analysis of the case S2 = lead us to the following general definition: 
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u = a 



'UJ' 
-TT . 



) > 



(B12) 



where {xj denotes the maximum integer < x, and cr{x) — 1 if a; > 0, <j{x) = — 1 otherwise. With the above position we 
have u £ [— 7r,7r], with positive vahies corresponding to right-handedness. Recalhng definition (|^) we can also write: 



W = TT + (7(52) {uj — Tt) 

Now we can derive a and h, fi'om Eq.(Q) and the condition 

ahu — li ■ 63 ~ L 

and hence, using once more Eq.M), curvature and torsion. The former can be expressed as 



(B13) 



(B14) 



L 



while torsion is given by 



1 



(B15) 



N ow w e recall that lu depends on through Eqs.(B4) and (B5), where cos(^) and a)sin(^) appear. From 

Eq.(B15) and the relation cos ^ =| cos ^ |, using the explicit expressions of the vectors appearing in Eq.(B5), wc find 
two formulas relating u and r to the dihedral angles: 



u 

cos- 



S2 



where the quantities at the right hand side of the above equations can be explicitly written as 



. I a I J CI - I /3 k . + 
c(bj) = sm cos( ' ^ ' ^ ) sm(— y— ) + cos ■ 



_sm( )sm(^), 



S2 — sin cos( 



In analogous way we find for w, z: 



CI + l/31 



) cos( 



COS 



i^sin(^ 
2 ^ 2 



)cos(— — ). 



COS — — (t(s2)c(cj) 

k2| 



which is identical to Eq.(^8|), where an obvious definitions of coefficients a,b,c,d has been performed. 



APPENDIX C: 



In this appendix we discuss the validity of the two approximations Eqs. (|39| , |41| ) we introduced in section III. The 
first approximation is the most relevant one, since it involves the leading hamiltonian Hf. The parameters /xi, ^2, c 
have been chosen so that the exact and approximate functions have maxima at the same positions, and the leading 
order in the series expansion at those points are the same, provided that the overlap between the two gaussians is 
vanishing: 

exp[-/3(n,-l)^^^^-i^]«0 . 
We check that the approximation is globally good by a comparison of the two integrals: 
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I2 



-/3(«.-l) <"' Ji' _^^-/3(n.-l) 



dwi 

7rC3 1 

where y = /3(ni - l)7oCic§/2. 

The first one can be evaluated exphcitly: 

where I^^[y) are modified Bessel function of the first kind. 
A numerical evaluation of 



I2 



(CI) 



(C2) 



(C3) 



(C4) 



reveals that, for y > 0.24 it always holds e < 0.12, with e decreasing to zero as y increases. 

For the above reasons we consider expression Eq.(39) as a good approximation, with a word of caution: the value 
of its right and left member differ significantly at Wi = ci = (/ii + /^2)/2, the maximum of the quartic potential 
appearing in H^. We have in fact that the former goes as exp[— 2y], while the latter gives 2exp[— 8j/]. This difference 
may however be considered negligible when both of the above quantities are very small, namely, for long enough 
helices and/or low enough temperature 

Coming to the second approximation, we see that also in this case Taylor expansions near the maxima (^o,i^: ^a,v) 
coincide. To check the global behaviour, we proceed as before, evaluating: 



13 = 



dw^dk{e~>^"' - 1) , 



(C5) 



moving first to polar coordinates: 



r cos(6') 



r sin(6') 



then introducing z — cxp(— r^), and finally resorting to formula Eq. 5.1.40 in the Abramowitz-Stegun handboot 
whereby: 



^3 ri„T^Tr[Ei{b) - ln(6) - 7] , 

where 7 is the Euler constant, Ei indicates the exponential integral function, and b — (3niji6^. 
This is to be compared to: 



X4 = 



dw.dkifl - 1) = 4 



-^smh (-) 



(C6) 



(C7) 



Again, an estimate of e' = |1 — 24/I3I reveals that e' ^ both for 6 ^ and for b 00, with a maximum of e' = 0.27 
at 6 = 2.97. 
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FIG. 1. Contour lines of (^^._\^^q , showing the two minima in {w,z) plane; the variable w, on the horizontal axis ranges 
from 7r/3 to 57r/3, while z, on the vertical axis, goes from to 1. Lines are drawn at intervals of 1, in arbitrary units, in the 
range [0,5] (deeper regions are darker). The saddle point corresponds to an height of 2.7. The figure is obtained with the values 
ci = 2.73, C2 = 2.77, cs = 0.99, d = 0, C5 = 150, cs = 0.86, cr = 0.16, cs = 3.5. 

FIG. 2. The same as in Fig.(l), but with w, z expressed as functions of (horizontal axis) and t/) (vertical axis) through 
Eq.(14). Both and tl) range from — tt to n. The values of the parameters are the same as in Fig.(l). Notice the essentially 
correct position and shape of the a and /3 minima; the unphysical third minimum is an effect of the approximate symmetry of 
Eq.(14) under the transformation <-> i/;. 

FIG. 3. Plot of P{l,woa) (continuous line), P{l,Wop) (dotted line) and Q.lq(l) (dashed line, below) for the considered protein, 
calculate for a 7-residues-long helix centered at position 1 (on the horizontal axis); q and P have been defined in Eqs. (12, 26). 
Notice the big maxima of the dipole moment characterising the a helices. 

FIG. 4. Ground-state configuration most similar to a bundle, among those obtained by numerical calculations: pbun = 0.031, 
V = 0.072 (see text). 

FIG. 5. Ground-state configuration with the worst value of f3bun and V, among those obtained by numerical calculations: 
Pbun = 0.32, V = 0.64. 
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